Optimal Code Length Based Cost for Unsupervised Grammar Induction
نویسنده
چکیده
An effective grammar can be induced from natural language sentences by simultaneously minimizing the cost of encoding the grammar and the cost of encoding the corresponding derivations of the sentences. In previous work, the cost of encoding a derivation was computed in terms of the number of bits it requires to encode which of the possible productions is used to expand each of its non-terminals. However, this ignored the fact that if some productions are used more often than others in the derivations, then they could be encoded with fewer bits using optimal code length based encoding. This paper presents a new derivation cost that uses such an encoding and applies it for inducing grammars. Minimizing this new derivation cost also corresponds to maximizing the probability of the derivations. Thus besides being theoretically more appealing, experimental results on sentences from clinical reports show that this new derivation cost also leads to induction of grammars that have better parsing performance.
منابع مشابه
On Induction of Morphology Grammars and its Role in Bootstrapping
Different Alignment Based Learning (ABL) algorithms have been proposed for unsupervised grammar induction, e. g. Zaanen (2001) and Déjean (1998), in particular for the induction of syntactic rules. However, ABL seems to be better suited for the induction of morphological rules. In this paper we show how unsupervised hypothesis generation with ABL algorithms can be used to induce a lexicon and m...
متن کاملAgent-Based Unsupervised Grammar Induction
In this paper, we describe an agent-based evolutionary computing approach to unsupervised grammar induction called grael (Grammar Evolution). Extending a general framework for data driven grammar optimization and induction, the evolutionary setup of grael can be used to automatically induce and optimize grammars from scratch on the basis of unstructured text. Agents are equipped with a very bas...
متن کاملIterative Rule Segmentation under Minimum Description Length for Unsupervised Transduction Grammar Induction
We argue that for purely incremental unsupervised learning of phrasal inversion transduction grammars, a minimum description length driven, iterative top-down rule segmentation approach that is the polar opposite of Saers, Addanki, and Wu’s previous 2012 bottom-up iterative rule chunking model yields significantly better translation accuracy and grammar parsimony. We still aim for unsupervised ...
متن کاملUnsupervised induction of stochastic context-free grammars using distributional clustering
An algorithm is presented for learning a phrase-structure grammar from tagged text. It clusters sequences of tags together based on local distributional information, and selects clusters that satisfy a novel mutual information criterion. This criterion is shown to be related to the entropy of a random variable associated with the tree structures, and it is demonstrated that it selects linguisti...
متن کاملUnsupervised Learning of Bilingual Categories in Inversion Transduction Grammar Induction
We present the first known experiments incorporating unsupervised bilingual nonterminal category learning within end-to-end fully unsupervised transduction grammar induction using matched training and testing models. Despite steady recent progress, such induction experiments until now have not allowed for learning differentiated nonterminal categories. We divide the learning into two stages: (1...
متن کامل